Analyzing COVID-19 disinformation on Twitter using the hashtags #scamdemic and #plandemic: Retrospective study

Introduction The use of social media during the COVID-19 pandemic has led to an "infodemic" of mis- and disinformation with potentially grave consequences. To explore means of counteracting disinformation, we analyzed tweets containing the hashtags #Scamdemic and #Plandemic. Methods Using a Twitter scraping tool called twint, we collected 419,269 English-language tweets that contained “#Scamdemic” or “#Plandemic” posted in 2020. Using the Twitter application-programming interface, we extracted the same tweets (by tweet ID) with additional user metadata. We explored descriptive statistics of tweets including their content and user profiles, analyzed sentiments and emotions, performed topic modeling, and determined tweet availability in both datasets. Results After removal of retweets, replies, non-English tweets, or duplicate tweets, 40,081 users tweeted 227,067 times using our selected hashtags. The mean weekly sentiment was overall negative for both hashtags. One in five users who used these hashtags were suspended by Twitter by January 2021. Suspended accounts had an average of 610 followers and an average of 6.7 tweets per user, while active users had an average of 472 followers and an average of 5.4 tweets per user. The most frequent tweet topic was “Complaints against mandates introduced during the pandemic” (79,670 tweets), which included complaints against masks, social distancing, and closures. Discussion While social media has democratized speech, it also permits users to disseminate potentially unverified or misleading information that endangers people’s lives and public health interventions. Characterizing tweets and users that use hashtags associated with COVID-19 pandemic denial allowed us to understand the extent of misinformation. With the preponderance of inaccessible original tweets, we concluded that posters were in denial of the COVID-19 pandemic and sought to disperse related mis- or disinformation resulting in suspension. Conclusion Leveraging 227,067 tweets with the hashtags #scamdemic and #plandemic in 2020, we were able to elucidate important trends in public disinformation about the COVID-19 vaccine.


Introduction
In 2021, almost four billion people were users of social media with the average user managing more than eight accounts on various social media platforms [1]. One such platform is Twitter, which has over 199 million daily monetizable active users and allows individuals to post, repost, like, and comment on 'tweets' of up to 280 characters that may include links, videos, or images. The vast majority of the posts are public [2].
Social Media can be the source of several types of false information: Misinformation, Disinformation, and Malinformation. Misinformation is false information not intended to harm. Disinformation is also false but carries the intent to harm. Malinformation represents genuine information intended to harm and may include leaks, harassment, and hate speech [3]. For our Twitter analysis, we selected two hashtags that represent mis-and disinformation (#plandemic and #scamdemic) to analyze the effect of false information.
The analysis of Twitter content has been used previously within the public health realm to understand public sentiment and gauge opinion on topics such as diabetes, the Affordable Care Act [4], social distancing [5], influenza [6], and measles [7]. Twitter may serve as a robust medium to better understand wide-scale, organic public perception about the COVID-19 pandemic [3,8,9]. Social media use during the COVID-19 pandemic has led to an "infodemic" generating mis-and disinformation with potentially grave consequences [10,11]. Starting in 2021, Twitter began applying labels to tweets that potentially contained misleading information about COVID-19. Twitter applied this new labeling policy to limit tweet visibility and spread of mis-and disinformation. Twitter mandated tweet removal across 11.5 million accounts and permanently suspended over 150,000 accounts for distributing misinformation [2,12].
The hashtags #scamdemic and #plandemic, which imply that the pandemic is a conspiracy, are frequently associated with intentional disinformation; however, tweets with these hashtags have not been examined to explore the scope of disinformation [13]. Understanding the extent and impact of false information is important for officials and public health agencies to predict population behavior including the potential uptake of vaccines and non-pharmaceutical measures such as masking and social distancing. Our hypothesis was that analysis of tweets associated with these hashtags would provide valuable insight about disinformation and the public's beliefs around the COVID-19 pandemic and would aid in developing targeted public health interventions.

Data collection and processing
On January 3, 2021, using the Twitter scraping tool Twint, we collected English-language tweets that contained the hashtags "#scamdemic" or "#plandemic" and were posted between January 1 and December 31, 2020. Subsequently on January 15, 2021, we used the Twitter application programming interface (API) to extract the same tweets (using the corresponding tweet IDs) to collect additional relevant metadata. We provided descriptive statistics for tweets including user profiles and tweet content and determined tweet availability in both datasets based on Twitter API status codes (User has been suspended or No status found with that User ID). We used Python version 3.9.1 software (Python Software Foundation, Wilmington, DE) for all data processing and analyses. Institutional review board approval was not required because this study used only publicly available data.

Sentiment & subjectivity and emotion analysis
To perform sentiment analysis for tweets, we tokenized them and cleaned and transformed tokens into their root form through natural language processing techniques such as stemming, lemmatizing, and removal of stop words. We used Python's VADER library to identify and classify the sentiment (positive, negative, or neutral) and subjectivity (objective or subjective) of tweets [14]. VADER applies a rule-based sentiment analysis with a polarity scale of −1 (most negative) to 1 (most positive).
For the subjectivity analysis, we used TextBlob to label each tweet from a range of 0 (objective) to 1 (subjective). Objective tweets relay facts, whereas subjective tweets typically communicate an opinion or belief. For the two hashtags #plandemic and #scamdemic, we visualized sentiment using a histogram of the subjectivity scores.
We used the Python library NRCLex to label the primary emotion for each tweet (fear, anger, anticipation, trust, surprise, positive, negative, sadness, disgust, or joy) [15].

Topic modeling
To identify the major topics discussed in our tweet library, we used the Gensim library in Python and applied an unsupervised machine-learning algorithm called Latent Dirichlet Allocation (LDA), which identifies clusters of tweets by a representative set of words [16].
We used the most highly weighted words in each cluster to determine the content of each topic. To find the optimal number of topics required by LDA, we trained several LDA models using different numbers of topics ranging from 2 to 100 and computed a topic coherence score (produced by evaluating the relative distance between the topics' most highly weighted words) for each LDA model. We ultimately chose a twelve-topic LDA model as it maximized the coherence score. One author without access or insight into the topic model labeled the topics using the 30 most frequently used terms ordered by weight. All authors then evaluated these topic labels and reached a consensus.

Demographics
Using m3 inference, we obtained user demographics including gender, age group, and type of account [17]. To obtain the ethnicity, we used the ethnicolr library in Python to predict the ethnicity of the user [18].
Twitter Web App was the most used platform by active (32.6%) and suspended (31.4%) users followed by Twitter for iPhone (28.2% and 290.0%). Less than 20% of tweets had media (image or video) and about one-quarter of tweets contained a URL. The median active user had over 8,000 posts and 470 followers and the median suspended user had over 12,000 posts and 610 followers. None of the users who tweeted the selected hashtags had his/her identity verified (blue checkmark) by Twitter. Table 1 shows the demographics of twitter users including age, gender, and ethnicity. Non-Hispanic Black users were significantly more likely to be suspended than active (11.3% vs 9.7%, P < 0.001) whereas Hispanic users were significantly less likely to be suspended (3.2% vs 5.1%, P < 0.001).
The largest group of users were 40 years or older. Males and non-Hispanic Whites represented the largest groups. (Table 1) Male users and users in the age groups < = 18 years and 30-39 years were overrepresented significantly among the suspended users. The vast majority of active and suspended users tweeted from personal accounts, 88.2% and 79.4% respectively.
We listed the characteristics of tweets in Table 2. Among all tweets, suspended tweets were significantly more likely to have likes (P < 0.001) and retweets (P < 0.001) compared to active tweets. The average number of hashtags per tweet was three (range 1-5), except active accounts using #scamdemic had an average of two per tweet.

Emotion analysis
In the analysis of emotions expressed in the tweets, fear was the most common emotion followed by trust, sadness, and anger. Disgust, surprise, and joy were least expressed (Fig 3). Suspended tweets were statistically more likely to express anger, disgust, and surprise.

Sentiment analysis
The overall sentiment for #plandemic and #scamdemic was negative, as noted in Fig 3. The mean weekly sentiments for #plandemic and #scamdemic were negative throughout the study period (Fig 4) with an overall mean sentiment -0.05 and -0.09 for #plandemic and #scamdemic, respectively (-1 denotes completely negative, 1 completely positive). During the week of May 4 th , 2020, the movie Plandemic: Indoctornation [19] was released, after which the polarity for both hashtags became more negative for several weeks. During the week of the United States election, there was a slight uptick in the mean polarity towards neutral, but following the election, the mean polarity became more negative for both hashtags, and for the first time, the mean polarity of #plandemic was more negative than #scamdemic.

Topic modeling
LDA identified 12 topics in our tweet collection and we subjectively labeled them based on the predominant keywords. (Table 4) The content of tweets were almost exclusively (>99%) representative of a single topic. The most frequent tweet topic was "Complaints against mandates introduced during the pandemic" (79,670 tweets), which included complaints against masks, social distancing, and closures, and had the highest percentage of suspended tweets. The next most popular topics included tweets "downplaying the dangers of COVID-19" (23,185 tweets), "Lies and brainwashing by the media and politicians" (18,871 tweets), and "Corporations and global agenda" (15,493 tweets). Overall topics had tweet suspension rates ranging from 16.6% to 36% (Table 4).

Discussion
Social Media can be the source of Misinformation, Disinformation, and Malinformation. We analyzed two hashtags that represent mis-and disinformation (#plandemic and #scamdemic) to analyze the extend of false information in social media.

Suspended tweets and users
Our observations of tweets for the year 2020 showed that more than 1 in 5 Twitter users (21.6%), who used any of the hashtags #plandemic or #scamdemic during 2020 had their accounts suspended in 2020. Suspended users were disproportionately more likely to be less than 18 years old or between 30 and 39 years old. Even though women use twitter more actively [20], men were more likely to use the selected hashtags in the first place and they were significantly overrepresented among the suspended users, which may reflect the fact that men are more likely to use taboo words or topics in tweets [21]. Accounts by non-Hispanic blacks and private individuals (vs. organizations) were disproportionally suspended.
Twitter suspensions have been historically linked to politics as a major theme, as with our hashtags [22]. Suspended tweets were statistically more likely to have likes, media content, and retweets and they were less likely to have links or mentions. The last finding that suspended tweets had less links (e.g., to newspaper articles) or mentions suggests that the tweets were less likely to report a verifiable fact than could be validated by readers. Suspended tweets were more likely to be engaging as indicated by a significantly higher rate of likes and retweets; however, this finding may also be attributable to previously reported communities that spread  Table 3. Example tweets with subjectivity and sentiment scores for each hashtag. misinformation [21]. As suspension on Twitter usually is triggered through crowdsourcing of users who report offensive or problematic tweets, tweets with more likes and shares that add to their distribution are more likely to be suspended.

Emotion analysis
The emotions fear, sadness, anger, and disgust were more frequently expressed than joy and surprise. Tweets that expressed emotions linked to fight-or-flight responses such as anger, disgust, and surprise were more likely to be suspended-perhaps because they triggered stronger emotions in readers resulting more reporting activity.

Objectivity & sentiment
The Objectivity/Subjectivity analysis of the tweets showed a predominance of subjective tweets. However, we realized many tweets in our collection were labeled by our tool as objective while the actual meaning was sarcastic. Sarcasm is a sophisticated construct to express contempt or ridicule. Tweets with sarcasm are thus rather subjective in nature [23]. Sarcasm has been shown to be the main reason behind false classification of tweets [24]. Phrasing a tweet in an objective manner does not mean that the content of the tweet is true. While 65% of tweets were labeled as purely objective in nature, they contained mis-and disinformation that was expressed in an objective fashion. Unlike our prior study looking at general COVID-19 related tweets [4], where we found a predominantly positive sentiment, the mean sentiments of the tweets in this study were expectedly more negative. Media events like the release of the 'Plandemic' movie further negatively affected sentiments.

Topic modeling
Our machine learning approach derived 12 main topics. Three topics were closely related, dealing with anger of pandemic mandates (shutdowns, masks, etc.) and politicians. Two topics focused on the roles of the media and corporations. Another four topics focused on downplaying the dangers of COVID-19 or the pandemic being a hoax or exaggerated. One standalone topic focused on the censoring of COVID-19 deniers and two advertised "documentaries" on COVID-19 or distributed vaccine misinformation.

Suspensions
Our analysis of tweets in 2020 with the hashtags #scamdemic or #plandemic provides important insight into the disinformation distributed on Twitter. One surprising finding was the rate by which users, who used the hashtags were suspended by Twitter. One fifth, who used the hashtags, had a suspension of their accounts by January 2021. Twitter allows users to report misleading tweets and to categorize them as health related and COVID-19 related tweets.

Limitations
Our study was limited by several factors. First, we selected a subset of tweets designed to provide us with tweets containing disinformation. As such, our library of tweets contained many tweets including sarcasm, which limited our ability to use tools we had used in prior studies [4,7]. Second, we used existing tools to analyze sentiments and emotion of tweets that are not specific to health care topics, which could have skewed our analysis. Finally, since we targeted only tweets in English and are unable to determine geographic location for users, we are limited in making conclusions about specific countries or countries where English is the not the predominant language.

Potential interventions
Our study demonstrates that it is possible to identify disinformation from tweets. In the future, public health agencies could automate the tools used to identify disinformation in real time and target it with replies that disseminate correct but related educational information. We envision public health "bots" as a means of de-arming disinformation spreaders.

Conclusions
Leveraging 227,067 tweets with the hashtags #scamdemic and #plandemic in 2020, we were able to explore topics successfully, and user demographics to elucidate important trends in public disinformation about the COVID-19 vaccine. In general, COVID-19 tweets demonstrated overall negative sentiment. Besides expressing anger over pandemic restrictions, substantial amounts of tweets were dedicated to presenting disinformation. More than one in five users who used these hashtags in 2020, were suspended by Twitter in January 2021.
Supporting information S1 File. This is the S1 File title.